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Abstract 

This  paper  investigates  the  use  of  Hidden  Markov 
models  (HMM’s)  for  the  classification  and  detection  of 
ocean  acoustic  events  in  a  nonstationary  ocean  back¬ 
ground.  A  statistical  formalism  is  described  for  inte¬ 
grating  models  for  dynamic  acoustic  events  and  ocean 
background  into  a  unified  statistical  framework.  In 
this  framework,  both  signal  processes  and  background 
processes  are  modeled  as  HMM’s,  and  signal  classifi¬ 
cation  is  performed  by  obtaining  the  likelihood  of  a 
corrupted  observation  sequence  through  a  combined 
state  space  of  signal  and  background.  Techniques  are 
presented  for  estimating  the  acoustic  event  model  pa¬ 
rameters  from  training  exemplars  that  are  observed  in 
these  difficult  background  conditions.  Finally,  a  novel 
neural  network  technique  is  proposed  for  the  auto¬ 
matic  learning  of  the  nonlinear  mechanism  through 
which  signal  and  background  observations  interact. 
Experimental  results  are  presented. 

1  Introduction 

The  ocean  acoustic  events  that  are  of  interest  in 
this  work  can  generally  be  characterized  as  short 
duration  non-stationary  events  whose  spectral  en¬ 
ergy  evolve  according  to  some  characteristic  temporal 
structure.  The  detection  and  classification  of  these 
acoustic  events  in  an  ocean  environment  is  compli¬ 
cated  by  the  presence  of  background  signals  that  are 
not  well  modeled  as  traditional  wideband  or  impulsive 
noise  processes.  In  fact,  the  ocean  background  may 
itself  contain  acoustic  events  which  are  similar  in  na¬ 
ture  to  those  events  that  we  are  trying  to  detect.  Ex¬ 
isting  techniques  that  have  been  developed  for  ocean 
signal  classification  do  not  explicitly  account  for  the 
desired  signal  having  been  observed  in  this  difficult 
ocean  background  environment  [1,  2,  3,  4,  5].  Failure 
to  do  so,  however,  can  result  in  severe  performance 
degradation,  especially  when  a  significant  mismatch 
in  the  background  characteristics  exists  between  the 
training  and  testing  of  the  classifier. 

The  principal  motivation  for  applying  HMM  tech¬ 
niques  to  classification  and  detection  of  acoustic  events 
is  that  they  provide*  a  m^ans  for  temporal  integra¬ 
tion  of  short-time  ti  ame  based  spectral  measurements. 
When  temporal  information  is  an  important  part  of 
the  signal  representation,  as  is  the  case  in  ocean  acous- 


Figure  1:  The  observed  signal,  y,  is  a  composite  of 
signal  from  background,  z,  and  signal  x. 


tic  events  [6],  frame  based  static  classifiers  can  provide 
poor  event  classification  performance.  This  point  was 
illustrated  by  a  study  comparing  the  performance  of 
selected  static  pattern  classifiers  in  classifying  vowels 
sounds  as  spoken  by  a  large  population  of  speakersf?]. 
It  was  often  the  case  in  this  study  that  a  classifier 
that  achieved  a  very  low  classification  error  rate  when 
classifying  independent  speech  frames,  achieved  a  very 
high  error  rate  when  classifying  the  entire  utterance. 
Hence,  if  the  signal  model  suffers  from  an  impover¬ 
ished  representation  of  temporal  information  in  a  sig¬ 
nal,  the  performance  of  the  resulting  classifier  is  also 
likely  to  suffer.  This  issue  has  been  addressed  in  pre¬ 
vious  work  by  applying  heuristic  rules  [8]  or  neural 
networks  with  time  delayed  inputs  [9,  2].  Recently, 
hidden  Markov  models  have  been  applied  with  some 
success  to  the  problem  of  ocean  acoustic  event  classi¬ 
fication  [10,  llj. 

The  principal  contribution  of  this  paper  is  to  extend 
the  definition  of  the  HMM  to  incorporate  the  effects  of 
ocean  background.  By  modeling  component  sources, 
the  system  developed  here  provides  a  robust  way  to 
train  and  classify  signals  under  differing  background 
conditions.  In  Section  2,  the  form  of  the  model  is 
introduced.  In  Section  3  a  maximum  likelihood  tech¬ 
nique  for  estimating  the  acoustic  event  model  parame¬ 
ters  is  introduced  along  with  a  proposed  hybrid  neural 
network  approach  for  estimating  the  process  of  signal 
corruption  by  background.  Finally,  in  Section  4  a  set 
of  experiments  is  performed  to  evaluate  the  effective¬ 
ness  of  the  approach  in  detecting  an  ocean  acoustic 
event  in  the  presence  of  actual  ocean  background. 

2  Modeling  Assumptions 

We  define  an  ocean  acoustic  event  a'  €  {A},  i  = 
1, . . .,  M,  taken  from  a  set  of  possible  events  {A}.  It 
is  assumed  that  an  acoustic  event  is  produced  with 


prior  probability  P{a)  and  that  there  is  an  acoustic 
channel  which  produces  D  dimensional  signal  vectors, 
X  =  {x\,X2,  ■  ■  ■  ,xt)  with  probability  P{x\a).  We 
depart  from  the  traditional  HMM  model  development 
by  assuming  that  the  signal  vectors  are  observed  in 
the  presence  of  of  an  ocean  background  process  which 
gives  rise  to  D  dimensional  background  observation 
vectors  Z.  The  output  sequence  Y  =  {yi,y2,  ■  ■  ■  ,w) 
is  then  observed  ^ls  a  component-wise  function  of  sig¬ 
nal  and  background,  =  g{zi,Xi).  We  consider  those 
functions  g(  '^  for  which  the  equation  y  =  g(z,x)  de¬ 
fines  a  one  dimensional  contour  in  the  x-z  plane. 

Both  the  signal  and  the  background  processes  are 
represented  by  HMM  models.  The  choice  of  the  topol¬ 
ogy  of  the  HMM  models  that  are  used  for  signal 
and  background  was  made  experimentally.  The  back¬ 
ground  HMM  is  a  4  state  fully  connected  model,  and 
the  signal  event  model  is  a  W  state  left-to-right  model 
containing  .IV  —  1  non-null  states  and  a  “null  state” 
that  always  emits  a  zero,  to  account  for  the  “no  sig¬ 
nal”  condition.  The  observation  probabilities  for  all 
states,  both  signal  and  background,  consist  of  single 
Gaussian  densities  with  diagonal  covariance  matrices. 

The  goal  in  acoustic  event  classification  is  to  choose 
that  event  d  by  maximizing  P(a\Y),  which  from  Bayes 
rule, 


P{a\Y)  = 


PiY\a)P{a) 

PiY) 


(1) 


is  equivalent  to  maximizing  P{Y  |a)P(aJ.  Estimating 
P{Y  I  a)  is  accomplished  using  a  probaoilistic  HMM 
to  represent  the  acoustic  event  a.  The  prior  probabil¬ 
ity  P(a)  is  estimated  from  higher  level  non-acoustic 
source  of  knowledge.  These  higher  level  hierarchical 
sources  of  knowledge  have  been  shown  in  [6]  to  be  crit¬ 
ical  in  acoustic  event  classification  by  humans.  The 
success  of  HMM’s  in  continuous  speech  recognition  is 
partly  attributable  to  the  ability  of  HMM’s  to  combine 
these  hierarchical  sources  of  knowledge.  It  is  expected 
that  HMM’s  will  provide  similar  benefits  in  the  area 
of  ocean  acoustic  event  classification. 


3  Estimating  Model  Parameters 
3.1  Maximum  Likelihood  Formulation 
The  ML  parameter  estimation  employed  here  is 
based  on  Rose  et  al.  [12]  and  is  similar  to  approaches 
taken  in  [13]  and  [14].  The  noise  corrupted  ob¬ 
servations  Y  arise  from  underlying  state  sequences 
f  =  (*i  1  *2i  •  ■  ■ .  •<)>  of  signal  HMM,  and  J  = 
(ji.JZi  •  ■ -lii )  of  the  background  HMM.  The  likeli¬ 
hood  of  the  output  sequence  given  signal  model  A, 
which  consists  of  the  node  dependent  mean  /!,■  and 
standard  deviation  of  the  Gaussian  HMM  observa¬ 
tion  probabilities  p(x<  1 1  =  i<)  and  transition  proba¬ 
bilities  p(i(  I  ii-i)  is  given  as 


W  I  A)  =  5:1:/ 

J  J  JC 


P{X,Z,I,J  \X)dXdY 


(2) 


where  the  summation  is  over  all  possible  state  se¬ 
quences  in  the  signal-background  state  space,  and  the 
notation  refers  to  the  integral  along  the  contour 


Ct  in  the  signal  background  observation  space  defined 
by  yt  =  g{xt,2t).  Expanding  the  observation  proba¬ 
bility  in  terms  of  the  joint  probability  of  the  hidden 
state  sequences  and  hidden  data  sequences  allows  us 
to  isolate  pertinent  terms  relating  to  the  signal  model 
parameters.  The  complete  data  likelihood  in  Eq.  2  is 
given  as 


PiX,Z,I,J\X)  = 

T 

XIp(»<+i  1  it)p{xt+i  l»t)p(i(+i  1  jt)p{zt+i  |j.) 


t=i 


(3) 

Given  an  initial  estimate  of  the  acoustic  signal  model 
parameters  A,  and  following  the  method  of  Baum  et 
al.  [15]  it  is  possible  to  find  a  new  set  of  model  param¬ 


eters  A  such  that  P(Y  j  A)  >  P(Y  j  A).  This  is  done  by 
maximizing  the  auxiliary  Q  function 


Q(A,A)  = 

E/  Ej  /c  A)  log  PiX,  Y,I,J\  X)dXdY. 

(4) 

In  our  simulations  the  probabilities  obtained  in  the 
forward  backward  algorithm  were  replaced  by  a  state 
sequence  produced  by  the  process  of  Viterbi  training. 
Viterbi  decoding  in  this  context  involves  selecting  the 
single  path  through  the  signal-background  state  space 
illustrated  by  the  diagram  in  Figure  2  that  maximizes 
P(y  I  A).  In  this  case,  the  summation  over  all  possible 
state  sequences  in  Eq.  2  is  replaced  by  a  max  over  all 
I  and  J. 

Space  does  not  permit  a  detailed  description  of  the 
steps  leading  to  the  expressions  for  the  ML  parameter 
estimates.  Taking  the  partial  derivative  of  Equation  4 
with  respect  to  the  signal  mean  yields  the  estimate 


Pi.ML 


X)  P(** >  X)E{xi |yi ,  it  J,,X) 

t  ;i _ 

»  ji 


(5) 


where 


E{xt\yt,U,jt,^) 


§c  I  I  ji)  dxtdyt 
§c  p(®«  I  )p(^«  I  jt ) 


and  P{i  =  it,j  =  jt  j  T,  A)  =  1  if  the  optimum  Viterbi 
path  passes  through  states  i  and  j  at  time  <,  and  equals 
0  otherwise. 

The  contour  integral  depends  on  the  definition  of 
the  noise  corruption  function  g{).  While  the  choice 
this  function  can  be  very  general,  not  all  choices  of  g{) 
will  lead  to  a  contour  integral  that  has  a  closed  form 
sobifion.  For  I'lstenre  if  y()  represents  an  additive 
function  of  signal  and  background 


g{x,z)  =  x  +  z,  (7) 

then,  using  the  notation  x  to  refer  to  a  single  compo¬ 
nent  of  the  vector  x,  it  can  be  shown  that  F(x  |  y)  of 


2 


HMM-Z 


Figure  2:  Multi-dimensional  Viterbi  lattice.  In  this 
example,  an  HMM  for  background  (^labeled  z),  con¬ 
taining  4  nodes,  and  an  HMM  for  a  signal  (labeled  x) 
with  four  nodes,  formed  a  Cartesian  product  with  16 
possible  states  (or  state  combinations)  per  time  frame. 
Viterbi  decoding  on  this  lattice  is  equivalent  to  decod¬ 
ing  on  a  standard  2-D  lattice  with  16  nodes. 


1.  Estimate  P(it,jt  |  Y,\)  via  Viterbi  decoding 

2.  Estimate  the  probability  density  functions  for 
signal  and  background 

(i)  Compute  for  the  decoded 

“non-signal”  observations 

(ii)  Compute  for  each  source 

model 

3.  Obtain  new  mapping  function  based  on  jO 
and  new  source  densities 

4.  Go  to  2  until  convergence  in  source  densities 

5.  Go  to  1  until  overall  convergence  is  achieved 

Figure  3:  Training  algorithm  summary. 


Eq.  5  is 


E{x\y) 


(8) 


3.2  A  Neural  Net  Mapping  Function 

For  spectral  parameters,  the  additive  assumption 
of  Eq.  7  is  inaccurate.  If  the  observations  are  aver¬ 
ages  of  N  magnitude  squared  spectral  frames,  then 
the  noise  corrupted  observation  probability  is  non¬ 
central  distributed  with  2N  degrees  of  freedom. 
No  closed  form  solution  for  E{x\y)  has  been  found  for 
this  case;  nor  have  closed  form  solutions  been  found 
for  other  important  noise  corruption  functions.  Nu¬ 
merical  evaluation  of  the  integrals  required  for  F(a;|y) 
can  involve  costly  iterated  integrals  and  series  expan¬ 
sions.  For  spectrograms,  reasonable  accuracy  can  be 
obtained  using  the  trapezoidal  rule  and  approximat¬ 
ing  the  non-central  density  with  50  summands  [16]. 
However,  it  may  be  simpler  and  faster  to  approximate 
^(xIj/)  by  training  the  parameters  of  a  general  map¬ 
ping  function  using  actual  observations.  Neural  nets 
can  be  trained  using  a  minimum  mean  squared  er¬ 
ror  criteria  to  approximate  a  mapping  from  clean  to 
noise  corrupted  observations  that  converges  with  infi¬ 
nite  data  to  E{x\y)  [17].  This  is  true  for  a  broad  class 
of  networks  inclucling  the  Multi-Layered  Perceptron 
(MLP)  trained  using  back-propagation,  and  the  Ra¬ 
dial  Basis  Function  (RBF).  This  suggests  the  use  of 
an  HMM-neural  net  hybrid  system  where  the  HMM 
provides  the  terr.fioral  decoding  and  the  neural  net¬ 
work  performs  the  mapping  from  the  space  of  uncor- 
r"p*ed  signal  vectors  to  the  space  of  observable  noise 
corrupted  observation  vectors. 

3.3  Algorithm  Summary 

The  Viterbi  lattice  that  results  from  the  compo¬ 
sition  of  signal  and  background  HMM’s  is  multi¬ 
dimensional  (Fig.  2).  Viterbi  decoding  on  a  multi¬ 
dimensional  lattice  IS  a  straight-forward  extension  of 


the  standard  2-D  case.  The  complete  algorithm,  sum¬ 
marized  in  Fig.  3,  has  one  Expectation  Maximization 
(EM)  estimator  embedded  in  another.  Convergence  is 
guaranteed  since  P{Y)  cannot  decrease  at  any  step  in 
Fig.  3,  by  virtue  of  the  properties  of  Viterbi  decoding 
and  ML  estimation. 


Figure  4:  Algorithm  of  the  HMM-neural  net  hybrid 
system.  The  input  to  the  neural  net  switches  to  sim¬ 
ulated  ocean  during  neural  net  training,  and  to  ob¬ 
served  ocean  when  estimating  source  model  parame¬ 
ters. 

However,  if  neural  nets  are  used  to  estimate  £(x|y), 
convergence  in  the  inner  EM  estimator  is  not  guaran¬ 
teed.  The  system  illustrated  in  Fig.  4,  simulates  the 
ocean  environment  internally  during  training.  The 
internal  simulation  is  necessary  to  generate  training 
data  for  the  neural  net  mapping  function.  Neural  net 
training  for  estimating  E{x\y)  has  the  problem  shown 
in  Fig.  5.  The  neural  net  does  not  perform  well  in  re¬ 
gions  with  limited  training  data.  These  regions  from  a 
mismatch  between  the  real  the  simulated  ocean.  This 
is  illustrated  by  the  noise  in  the  Radial  Basis  Function 
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Figure  5:  Scatterplot  for  1000  samples  of  signal  (a) 
vs.  its  (A)  2nd  order  non-central  x^-distribution  (a 
single  frame  of  magnitude-squared  spectra),  and  (B) 
8th  the  order  case  (4  spectral  frame  averages).  The 
corresponding  estimate  of  £(x|i/)  are  plotted  for  the 
LMS  (solid  line)  and  RBF  (dashed  line)  techniques. 
Px  =  1,  =  .14,  =  .3,  tr^  =  .5 


Figure  6:  Time  vs.  frequency  (via  wavelet  decompo¬ 
sition)  for  a  sample  in  the  test  set.  Horizontal  ticks 
mark  .  1  sec  intervals.  Solid  horizontal  long-dashes  on 
the  left  side  are  the  Viterbi  decoder’s  estimates  of  the 
starting  and  stopping  boundaries  for  signal  “A”  (a  10 
msec  broadband  event). 


(RBF)  output  for  large  j/’s  in  Fig.  5.  For  large  values 
of  y,  is  nearly  linear.  A  linear  mapping,  imple¬ 

mented  as  using  the  LMS  algorithm,  does  a  good  job 
in  this  case,  and  in  the  case  of  approximating  Eq.  8, 
which  is  linear  to  begin  with.  However,  the  linear  es¬ 
timator  is  inaccurate  for  small  values  of  y,  especially 
in  the  second  order  case  in  Fig.  5,  where  the  RBF  cor¬ 
rectly  estimated  a  slight  downward  nonlinearity.  The 
linear  LMS  estimator  is  used  for  the  work  presented 
here. 

4  Experimental  Results 

I'he  purpose  of  the  preliminary  synthetic  data  ex¬ 
periment  described  in  this  section  is  to  validate  the 
algorithms  presented  above.  The  DARPA  Standard 
Phase  I  database  is  designed  to  test  conventional  clas¬ 
sifiers.  It  does  not  have  the  adverse  background  condi¬ 
tions  that  this  algorithm  is  designed  to  handle.  There¬ 
fore,  an  adverse  condition  testing  set  was  constructed 


Table  1;  Summary  of  experiments  performed  on  addi¬ 
tive  normal  components.  Results  labeled  with  “LMS” 
were  obtained  using  neural  net. 


TRAINING 

SNR 

GAIN 

FALSE  ALARM 
RATE  (sec~^) 

MISS 
(%  ERR) 

high 

0 

0 

0% 

low 

0 

.041 

0% 

low  (LMS) 

0 

0 

0% 

high 

-6dB 

.123 

5% 

low 

-6dB 

.165 

20% 

low  (LMS) 

-6dB 

.082 

20% 

by  mixing  different  parts  of  the  Phase  I  dataset.  The 
Phase  I  signals  used  were  signal  “A”,  a  10msec  wide¬ 
band  pulse,  “E”,  a  1  second  low  frequency  tonal,  and 
the  “quiet  ocean”  background.  The  testing  set  con¬ 
sists  of  24.3  seconds  of  data,  containing  10  samples  of 
E,  and  20  samples  of  A.  The  E  samples  overlapped  A 
samples  9  times,  to  simulate  an  adverse,  highly  non¬ 
stationary  background.  The  signals  were  mixed  into 
the  background  with  a  gain  of  either  0  or  -6dB.  Fig.  6 
shows  a  sample  from  this  testing  set.  The  solid  hori¬ 
zontal  long-dashes  at  the  left  are  the  Viterbi  decoder 
estimates  of  the  starting  and  ending  times  for  signal 
A  events  that  it  detected.  The  Viterbi  decoder  used 
models  that  were  trained  offline  (in  a  high  SNR,  no 
interference  condition)  or  in  adverse  conditions  that 
were  similar  to  the  test  set. 

To  evaluate  the  robustness  of  the  neural  net  approx¬ 
imation  of  B(xlff),  relative  to  the  closed  form  solution, 
this  first  set  of  experiments  involve  mixing  signals  in 
background  after  spectral  decomposition.  This  was 
necessary  since  Eq.  8  works  only  for  additive  normal 
components,  and  it  is  found  that  spectral  components 
are  not  well  approximated  by  Eq.  7. 

Tab.  1  summarizes  the  results  on  this  database. 
Models  were  trained  either  under  similar  adverse  con¬ 
ditions,  or  they  were  trained  with  no  noise  and  no 
other  signal  (high  “training  SNR”  in  Tab.  1).  The 
results  in  Tab.  1  show  that  it  is  possible  to  (1)  train 
models  of  signal  A  under  adverse  background  condi¬ 
tions,  and  to  (2)  train  models  of  A  under  high  SNR 
conditions,  and  then  detect  and  classify  A  under  ad¬ 
verse  background  conditions.* 


'Tab.  1  shows  that  the  use  of  neural  net  (LMS)  to  estimate 
E(xly)  lead  to  better  results  than  when  the  closed  form  solution 
is  used.  This  is  probably  a  procedural  artifact:  The  closed 
form  system  was  trained  to  five  iterations,  yielding  the  results 
reported  in  Tab.  1.  Neural  net  training  was  initialized  with  the 
final  peu-ameters  from  the  closed  form  training.  P(y)  increased 
during  each  of  5  neural  net  iterations.  Therefore,  the  results 
reported  for  neural  net  training  is  slightly  better.  Tests  using 
Radial  Basis  Function  (RBF)  and  Multi-Layered  Perceptron 
(MLP)  had  very  high  false  alarm  rates,  for  reasons  discussed 
above. 
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5  Summary 

The  techniques  presented  here  uses  Hidden  Markov 
Models  and  the  Maximum  Likelihood  (ML)  formal¬ 
ism  to  address  the  issues  of  temporal  structures  and 
nonstationary  backgrounds.  Temporal  structures  pro¬ 
vide  an  important  cue  for  human  acoustic  events  clas¬ 
sifiers  [6];  but  is  not  well  exploited  by  static,  frame- 
based  classifiers.  HMM’s  can  provide  a  succinct  model 
for  temporal  structures.  The  ML  technique  integrates 
HMM  models  of  component  processes  to  provide  a  ro¬ 
bust  way  to  handle  highly  nonstationary  background. 
In  the  experiments  presented  here,  signal  models  were 
trained  offline  under  high  SNR  conditions,  and  then 
used  to  detect  and  classify  signal  events  under  an  ad¬ 
verse,  highly  nonstationary  background  The  exper¬ 
iments  also  demonstrate  an  ability  to  t>'ain  for  sig¬ 
nal  parameters  when  a  training  set  of  clean,  isolated 
events  is  not  available. 
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